时序数据取样方法
语境 (Context)
In most studies, it is pretty hard (or sometimes impossible) to analyse a whole population, so researchers use samples instead. In statistics, survey sampling is the process by which we get a sample from our population, in order to conduct a survey. As data scientists, we usually use data that was previously collected, so we don’t spend too much time thinking about how to actually do this. As we will see in this article, however, our data can have different biases, depending on how it was sampled, so you better understand the implications of each of this sampling designs. There are many ways of drawing those samples and, depending on the context, some can be better than others.
在大多数研究中,很难分析整个人口 (有时甚至是不可能),因此研究人员使用样本代替。 在统计中,调查抽样是我们从人口中获取样本以进行调查的过程。 作为数据科学家,我们通常使用以前收集的数据,因此我们不会花太多时间思考如何实际执行此操作。 但是,正如我们将在本文中看到的那样,我们的数据可能会有不同的偏差,具体取决于如何采样,因此您可以更好地理解每种采样设计的含义。 绘制这些样本的方法有很多,根据上下文的不同,有些方法可能更好。
概率x非概率 (Probability x non-probability)
There are two broad categories of sampling designs: probability and non-probability. In probability sampling, each element of the population has a known and non-zero probability of being in the sample. This method is usually preferable, since its properties, such as bias and sampling error, are usually known. In non-probability sampling, some elements of the population may not be selected and there is a great risk of the sample being non-representative of the population as a whole. However, probability sampling can sometimes not be possible under some circumstances, or it can just be cheaper to do it non-randomly.
抽样设计分为两大类:概率和非概率。 在概率抽样中 ,总体中的每个元素都有一个已知且非零的 概率出现在样本中。 通常首选此方法,因为它的属性(例如偏差和采样误差 )通常是已知的。 在非概率抽样中 ,可能不会选择总体的某些元素,并且存在很大的风险,即抽样不能代表整个总体。 但是,在某些情况下有时不可能进行概率采样,或者非随机地进行概率采样会更便宜。
Let’s now take a look at some of the different sampling designs in each category and their pr